home
***
CD-ROM
|
disk
|
FTP
|
other
***
search
/
Turnbull China Bikeride
/
Turnbull China Bikeride - Disc 2.iso
/
BARNET
/
PROGRAMMING
/
DOCS
/
TOPTIPS
< prev
next >
Wrap
Text File
|
1998-06-02
|
12KB
|
413 lines
<html>
<head> <title> Top Tips on programming in ARM Assembler </title> </head>
<body>
This page demonstrates some of the optimisations which are possible when
programming in ARM assembler. It's based on my experience of optimising
other people's code, so it's real examples of tricks that people overlook.
This page assumes a familiarity with the ARM instruction set, and
programming in assembler in general.
<h1> Use of conditionals </h1>
Conditional execution of instructions is one of the best things about the
ARM instruction set, and yet people overlook it so often. For example,
returning with the V bit set from a function is often used to indicate
an error occurred:
<pre>
STMFD r13!, {r14}
BL func1
LDMVSFD r13!, {pc}
ADD r0, r1, r2
BL func2
LDMVSFD r13!, {pc}
SUB r1, r2, r0
BL func3
LDMVSFD r13!, {pc}
LDMFD r13!, {pc}^
</pre>
can be rewritten as:
<pre>
STMFD r13!, {r14}
BL func1
ADDVC r0, r1, r2
BLVC func2
SUBVC r1, r2, r0
BLVC func3
LDMVCFS r13!, {pc}^
LDMFD r13!, {pc}
</pre>
A saving of 8 bytes in space, and 3 instructions in execution time for the
case when no errors occur - which is hopefully the more common! I also
find it more readable and it's easier to trace the flow of execution.
Naturally, if you have to do comparisons then this is not possible and
you'll have to either exit or branch over the conditional segment of code.
It can be more efficient to do this if you have a long procedure exit
sequence, but normally exiting is merely a matter of unstacking the
registers and writing the appropriate value to the program counter which
is one instruction.
<h1> Use of the correct comparison </h1>
The different types of comparison available often confuse the novice
programmer. Use of signed comparisons instead of unsigned is the most
common error which is made, which can be merely inefficient (and
<a href="#final">here's an example</a>), but often leads to bugs when
memory ranges are suddenly greater than 2GB.
<pre>
Unsigned Signed
HS (CS) GE
HI GT
LS LE
LO (CC) LT
</pre>
PL and MI directly test bit 31 of the result (ie the N flag is a copy of
bit 31). They are subtly different to GE and LT and I'd like someone to
give me a case where they should be used instead of GE or LT.
<h1> Don't misuse the stack </h1>
Obviously, don't stack registers that aren't required to be preserved -
which these are depends on whether you are writing APCS conforming code,
or whether you have your own Procedure Call Standard, or whether you work
on an ad hoc basis. But it's not necessary to stack r14 if the function
is a leaf function (calls no other functions). However, if you do find
yourself in need of more registers than you have available, then r14
should be the first register you stack on the grounds of efficiency.
Consider the two functions:
<pre>
STMFD r13!, {r14}
MUL r0, r1, r2
LDMFD r13!, {pc}
STMFD r13!, {r1, r2}
LDR r1, [r12, #8]
LDR r2, [r12, #12]
MUL r0, r1, r2
LDMFD r13!, {r1, r2}
MOVS pc, r14
</pre>
Better would be:
<pre>
MUL r0, r1, r2
MOVS pc, r14
STMFD r13!, {r14}
LDR r14, [r12, #8]
LDR r0, [r12, #12]
MUL r0, r14, r0
LDMFD r13!, {pc}
</pre>
In the second function, restoring registers is combined with returning
from the procedure call. Since r14 must be corrupted by the function
call, the caller is not expecting its value to be any particular value at
the end of the function. This has been exploited in at least one program
which I know of to return function values in r14, but you really need
to be aware of exactly what you're doing with that technique, and I've
never dared try it myself.
<p>
A further optimisation which can be used is to use a single store rather
than a multiple store if you're only storing r14. The above code then
becomes:
<pre>
STR r14, [r13, #-4]!
LDR r14, [r12, #8]
LDR r0, [r12, #12]
MUL r0, r14, r0
LDR pc, [r13], #4
</pre>
Beware that you can't tell LDR to restore the program counter flags.
This is not an issue if you are using 32-bit APCS-3, but most routines
on 26-bit ARMs preserve the flags over a function call.
<p>
It can sometimes be a win to only stack registers conditionally. Here's
an example:
<pre>
LDR r2, [r12, #0]
TEQ r0, r2
MOVEQ pc, r14
STMFD r13!, {r14}
BL func
SUB r0, r0, #1
LDMFD r13!, {pc}
</pre>
This can be more efficient if r0 is often equal to r2; for example if this
were an error check.
<p>
Another way in which the stack can be abused is to store data _below_ the
stack pointer. This will appear to work but you will get random crashes.
This is because interrupt code is entitled to use space on the stack
temporarily; if you have left the stack pointer in the wrong place, you
will have problems.
<h1> References to data within a big program </h1>
When your program gets to more than 4k, you start to run the risk of
not being able to address data. This is because LDR has a maximum
range of +/-4095 bytes. ADR (which is a pseudoinstruction for
ADD rN, pc, x, which takes into account the fact that
the program counter is 2 instructions ahead of the currently executing
instruction) uses the normal 8 bits rotated by a multiple of 2 scheme,
so this can get you further away but at the cost of not being able to
directly access every address. Many people have FNadrl macros and at
least two assemblers that I know of have ADRL instructions. They are
implemented as something like:
<pre>
ADR rN, (offset AND &FF)
ADD rN, ((offset - P%)AND &FF00)
</pre>
with slight modifications to cope with a negative offset, and the cunning
ones actually try various possibilities such as
<pre>
ADR rN, (offset AND &3FC)
ADD rN, ((offset - P%) AND &3FC00)
</pre>
to gain them more distance.
<p>
Nevertheless, these pseudo-instructions take 2 instructions, and ADRX
(for constants which can't be reached with ADRL) takes 3. It is better
to avoid using these if possible. Try moving data around so that it's
nearer the functions which reference it. If that's not possible, try
moving functions around.
<p>
If you have data which are used throughout the program, consider
allocating a register to point to their address throughout the execution
of the program. In RISC OS modules, this is assisted by Acorn by pointing
r12 at a block of private workspace.
<p>
Under no circumstances consider doing the following:
<pre>
BL getvar
...
.getvar
LDR r0, var
MOVS pc, r14
.var
EQUD 0
</pre>
since this involves two pipeline flushes and wrecks most caching
strategies so it is guaranteed to be slower. It also goes against the
principle of keeping data and code separate which is more important on
StrongARM processors with their split instruction/data caches.
<h1> Working with large constants </h1>
By large constants, I don't mean constants with a large magnitude like 2^31,
I mean constants which don't fit in the ARM's scheme for immediate constants;
ie 8 bits rotated by a multiple of 2. This example is from IscaFS:
<pre>
LDR r0, [r6, #24]
BIC r0, r0, #&ff000000
BIC r0, r0, #&00ff0000
MOV r1, r1, #&53
ORR r1, r1, #&ef00
TEQ r0, r1
</pre>
I replaced this with:
<pre>
LDR r0, [r6, #24]
MOV r0, r0, LSL #16
EOR r0, r0, #&ef000000
TEQ r0, #&00530000
</pre>
Shifting r0 left by 16 automatically zeroes the leftmost 16 bits
and discards the previous top 16 bits, which is the desired effect.
The crucial step is noticing that TEQ is the same as EORS, without a
destination register. So if the top byte of r0 is not &ef then the
second test can never be true. This trick saves 2 instructions and
one register.
<p> A similar technique can apply to other situations, for example
checking that a 16 bit value is less than a given value can be done
by
<pre>
CMP r0, #&xy00
CMPEQ r0, #&00za
</pre>
is the same as
<pre>
MOV r1, #&xy00
ORR r1, #&00za
CMP r0, r1
</pre>
but takes one instruction fewer and uses one register fewer.
<p>
If it's utterly unavoidable, you may need to put a large constant into a
register. It is thought that there are no constants which cannot be
produced in 3 instructions, though no-one has an algorithm for producing
arbitrary constants (to the best of my knowledge). On a StrongARM, it
is normally quicker to load an instruction instead of using 3 to synthesise
it. On an ARM6, it is quicker to use 3 instructions. Take your pick.
<h1> Strength reduction </h1>
This term covers a number of optimisations. One is to reduce the cost of
per-iteration calculations.
<pre>
MOV r0, #200
MOV r4, #8
.loop
MUL r1, r0, r4
...
SUBS r0, r0, #1
BNE loop
</pre>
becomes
<pre>
MOV r1, #1600
.loop
...
SUBS r1, r1, #8
BNE loop
</pre>
This saves a MUL instruction in the loop, which is a slow operation on many
variants of the ARM. In this case, it also saves 2 registers, though in
practice this is not often achieved.
<h1> Count down, not up </h1>
The above example also illustrates that it's better to count down in a loop
than up. Here's a worse example:
<pre>
MOV r1, #0
.loop
...
ADD r1, r1, #8
CMP r1, #1600
BNE loop
</pre>
The SUBS used above combines the test-for-end with the loop-count, saving
an instruction.
<h1> Unrolling loops </h1>
Since the instructions used per-loop are not doing useful work, it is often
better to unroll the loop a little. This can add extra overhead in places,
so you should be cautious. It may also pay to unroll the loop entirely, as
the C compiler may sometimes do with memset. To reuse the example above:
<pre>
MOV r1, #1600
.loop
...
SUB r1, r1, #8
...
SUBS r1, r1, #8
BNE loop
</pre>
In this example, since I haven't specified what is in `...', it's not
possible to combine the first SUB in with it which might be possible in
real code. This optimistaion saves half an instruction (and a cache line flush) per loop
- 800 in total. Against that, it takes more cache lines, and more RAM in
general. The only way to find if this is a win or not is to benchmark the
code in question.
<h1> Dividing </h1>
There are already several good algorithms out there and I
don't think I have anything to contribute myself. Please see <a
href="http://www.comlab.ox.ac.uk/oucl/users/robin.watts/Docs/Arc/Programmin/div-fast.html">
This page</a> and <a
href="http://www.comlab.ox.ac.uk/oucl/users/robin.watts/Docs/Arc/Programmin/div-short.html">
this one</a> for good examples.
<h1> Some further examples </h1>
That's about all the coding tricks I can remember for the moment. If you
want to see a real example of code which I've improved, try looking at
IscaFS which was originally written by Phil Norman.
<p>
I'll leave you with some more fictional examples:
<a name="final"></a>
<pre>
STMFD r13!, {r14}
CMP r0, #5
BGT not
CMP r0, #0
BLT not
LDR r0, [r12, #8]
B over
.not
MOV r0, #0
SUB r0, r0, #1
.over
LDMFD r13!, {pc}
</pre>
Can be optimised to:
<pre>
CMP r0, #5
LDRLS r0, [r12, #8]
MVNHI r0, #0
MOV pc, r14
</pre>
This example demonstrates several points:
<ol>
<li> Use of unsigned instead of signed comparisons
<li> Use of conditionals to avoid branches
<li> Not storing the link register if avoidable
<li> Remembering one of the more `esoteric' instructions
</ol>
Thanks for reading, I appreciate feedback and if you have any tricks you'd
like to share with the ARM programming community at large then please send
them to me.
<p>
You probably want to look at these pages for further tips:
<a href="http://www.comlab.ox.ac.uk/oucl/users/robin.watts/Docs/">Robin
Watts' ARM programming page</a>
<p>
I'd like to thank Phil Norman and Peter Burwood for their help.
<hr>
<i><a href="mailto:willy@bofh.ai">Matthew Wilcox</a></i>
</body>
</html>